Multiomics dynamic learning enables personalized diagnosis and prognosis for pancancer and cancer subtypes

Abstract Artificial intelligence (AI) approaches in cancer analysis typically utilize a ‘one-size-fits-all’ methodology characterizing average patient responses. This manner neglects the diverse conditions in the pancancer and cancer subtypes of individual patients, resulting in suboptimal outcomes in diagnosis and treatment. To overcome this limitation, we shift from a blanket application of statistics to a focus on the explicit recognition of patient-specific abnormalities. Our objective is to use multiomics data to empower clinicians with personalized molecular descriptions that allow for customized diagnosis and interventions. Here, we propose a highly trustworthy multiomics learning (HTML) framework that employs multiomics self-adaptive dynamic learning to process each sample with data-dependent architectures and computational flows, ensuring personalized and trustworthy patient-centering of cancer diagnosis and prognosis. Extensive testing on a 33-type pancancer dataset and 12 cancer subtype datasets underscored the superior performance of HTML compared with static-architecture-based methods. Our findings also highlighting the potential of HTML in elucidating complex biological pathogenesis and paving the way for improved patient-specific care in cancer treatment.


INTRODUCTION
Unlocking the full potential of artificial intelligence (AI) in cancer analysis requires a departure from the conventional 'one-sizefits-all' approach.The concept of generalizing patient responses fails to acknowledge the complex and diverse conditions of individuals during diagnosis and treatment.This antiquated viewpoint inevitably leads to ineffective treatments, where success stories are overshadowed by disappointing outcomes.To truly revolutionize cancer analysis through AI, we must recognize and embrace the individuality of each patient.
Gone are the days when we could rely on generalized patient responses as a basis for treatment decisions.This outdated approach often leads to suboptimal outcomes, where some patients experience remarkable success while others face disappointing results.The future lies in AI-enabled personalized medicine [1], offering a solution to the uncertainties that plague cancer diagnosis and treatment planning.By shifting our focus from mere statistics and the law of averages, we can now delve into the intricate molecular-level multiomics data of each patient [2][3][4], pinpointing abnormalities with precision and tailoring treatment strategies accordingly [5].
This powerful tool is poised to significantly enhance the capabilities of clinicians, positioning them at the cutting edge of proactive and therapeutic interventions [6].By harnessing the power of AI, healthcare professionals can predict the likelihood of disease onset, ensuring that intervention happens at the earliest possible stage.Beyond just detection, AI aids in tailoring treatments to individual patients, heralding the age of personalized medicine.This ensures that treatments are optimized for each patient's unique genetic makeup and health profile, leading to more effective and fewer side effects.The integration of AI into medical practice brings with it a renewed sense of hope.As we delve deeper into the vast complexities of diseases like cancer, the promise of AI-driven solutions hints at a future where these formidable challenges can be comprehensively understood and addressed [7].
Our study proposes a highly trustworthy multiomics learning (HTML) framework (Supplementary Figure S1 available online at http://bib.oxfordjournals.org/),a multiomics dynamic learning framework that is meticulously designed to address the critical features and requirements essential for cancer analysis [8][9][10][11].HTML offers efficiency, interpretability, and reliability in integrating multiomics data by processing each sample with datadependent architectures and computational f lows, opening new avenues for research and clinical practice.We introduced sampleadaptive feature selection and modality selection dynamic learning modules to allocate different weights to features and modalities according the input information.Inspired by biological knowledge, we devised DNA methylation-guided attention and triple contrastive learning modules within HTML, enabling us to replicate hierarchical relationships and align representations of biological components.These approaches not only enhance our understanding of various genes and data modalities but also facilitate personalized diagnosis and treatment through a sampleadaptive process.Compared with other methods, HTML stands out with its superior performance, remarkable interpretability and biologically-guided mechanisms, excelling in capturing intricate biological relationships and identifying biomarkers effectively.
To demonstrate the power of HTML, we construct a comprehensive dataset encompassing a 33-type pancancer dataset and 12 cancer subtype datasets.The results are astounding.HTML surpasses current state-of-the-art approaches, achieving a remarkable increase of 3.51% in pancancer classification accuracy and an impressive 6.8% increase in cancer subtype classification accuracy compared to previous state-of-the-art methods.Moreover, HTML proficiently classifies patients into high-risk and low-risk groups, accurately determining disease risk levels within individual cancer subtypes.Through feature dynamic weights, we identify 849 potential biomarkers across all cancer subtype datasets, many of which have already been highlighted in published articles for their significance in cancer research.
In the spirit of promoting collaboration and facilitating further scientific advancement, we have made the complete HTML package available online.This package includes tutorials, demo cases, training data, pretrained models and test results from our study, all accessible to the broader scientific community.The HTML package can be found at https://github.com/YuxingLu613/HTML.

A highly trustworthy multiomics learning model
In this paper, we present HTML, a highly trustworthy multiomics learning model explicitly designed for multiomics classification tasks (as depicted in Figure 1b).The proposed framework comprises five primary modules: • An inner-omics feature dynamic module that self-adaptively selects informative features (Figure 1c), • An interomics DNA methylation guided attention module that appropriately integrates complementary information among diverse modalities (Figure 1d), • A triple contrastive learning module that aligns the features of diverse modalities (Figure 1e), • A modality dynamic module that quantifies the relative contribution of each modality toward the final classification results (Figure 1f), • An uncertainty module for prediction and integration employs robust statistical tools (Dirichlet distribution and Dempster-Shafer theory) to provide reliable predictions and enhance the explainability of the model (Figure 1f).
When all five modules of the proposed HTML architecture are integrated with each other, the resulting framework offers both high performance and significant explainability, making it an ideal multiomics integration model for clinical applications such as cancer diagnosis and survival analysis.
HTML is an end-to-end model that can effectively improve the performance of multiomics classification tasks, enabling personalized diagnosis and treatment and facilitating the discovery of novel biomarkers.Our extensive research has led us to firmly believe that HTML constitutes the foremost approach to multiomics integration in biomedical data classification, characterized by high performance, trustworthiness and sensitivity to individual variations.

Dataset
We constructed a large-scale pancancer classification dataset of 33 different cancer types and 12 cancer subtype classification datasets from TCGA to showcase the effectiveness of HTML (Methods).Three types of omics data (DNA methylation (DNA meth), mRNA expression (mRNA), and micro-RNA expression (miRNA) except for the pancancer dataset) were used to enable multiomics integration and classification, leading to the elaboration of complementary information pertaining to cancers.
In our experiments, only the samples that matched omics data were included in our study (Supplementary Tables 1-3 available online at http://bib.oxfordjournals.org/).In light of the potential impact of noise and redundant features on the overall quality, we conducted preprocessing and feature preselection on each type of omics data independently.Supplementary Tables 2 and  3 (available online at http://bib.oxfordjournals.org/)display the precise feature quantity before and after preprocessing.

Experimental settings
We conducted a comparative analysis of the classification accuracy of HTML in relation to preexisting multiomics integration algorithms (Table 1 and Supplementary Table 4 available online at http://bib.oxfordjournals.org/)and performed a comprehensive set of ablation studies (Supplementary Table 5 available online at http://bib.oxfordjournals.org/) to demonstrate the crucial roles played by distinct components in HTML.The hyperparameter selection comparison for each dataset is shown in Supplementary Table 6 available online at http://bib.oxfordjournals.org/.To provide a more valid estimate of HTML's performance, we executed a 5-fold cross-validation and subsequently reported the mean and standard deviation (SD) of the results on the validation fold.
We employed a battery of metrics for the evaluation of the selected methods: accuracy, average F1 score weighted by support (F1-weighted), macroaveraged F1 score (F1-macro), average area under the receiver operating characteristic curve (AUROC) and average area under the precision-recall curve (AUPRC) for each classification task.It is worth mentioning that we have incorporated an uncertainty metric in HTML to convey the model's level of confidence in the classification results.In all of our experiments, we conducted the research on a single Tesla V100 GPU.

HTML outperforms existing methods in pancancer and cancer subtype diagnosis
To evaluate HTML's performance in cancer-subtype classification, we compared the classification performance of HTML with the Table 1: Performance of HTML under different cancer subtypes.To evaluate the effectiveness of HTML, we conducted experiments on 12 different datasets from TCGA and compared it with ten prevalent multiomics classification methods.Our results show that HTML outperforms all other methods in terms of accuracy, F1-macro, and F1-weighted scores on all datasets and achieves the best AUROC and AUPRC scores on most datasets.To ensure the reliability of our results, we applied 5-fold validation and calculated the mean value and standard deviation of each metric for each method.Conversely, when the probability distribution of a specific label is significantly high, it suggests that the model has a higher degree of confidence in its classification result and a more precise prediction.The contribution of each module in HTML was comprehensively addressed through the ablation study in Supplementary Table 4.
following ten existing classification algorithms (five are traditional machine learning models and five are deep neural networks): • Naive Bayes (NB) [12].
All the above methods were trained with the same preprocessed data.All the evaluation metrics of the cancer subtype classification tasks are shown in Table 1, and all the evaluation metrics of the pancancer classification tasks are shown in Supplementary Table 4 available online at http://bib.oxfordjournals.org/.
As shown in Figure 1a and Supplementary Figure 2 available online at http://bib.oxfordjournals.org/,HTML exhibits outstanding performance in the 33-class pancancer classification task, achieving an overall accuracy of 93.34 and an F1-weighted score of 92.22.Notably, we observed that the majority of misclassifications occurred within the same cancer meta-type, such as colon adenocarcinoma (COAD) and rectal adenocarcinoma (READ), both of which are malignancies that manifest in the large intestine.Furthermore, a subset of ESCA samples is erroneously classified as HNSC, implying a potential correlation between these cancer subtypes.
With regard to cancer subtype classification, it is evident that HTML outperformed other methods for most cancer subtype classification datasets.However, some exceptional cases emerged.For instance, the AUROC of MOGONET exceeded that of HTML on the GBMLGG dataset by 1.56%, but HTML still performed better than MOGONET regarding all other metrics.Moreover, although HTML exhibited incompetence with SVM and MOGONET in the STAD and KIPAN datasets, it surpassed all other models across the remaining metrics.Notably, the data imbalance in the UECE dataset, where the largest subtype contains ∼15 times more samples than the smallest subtype, may inherently introduce instability into the AUROC and AUPRC results, which may explain the relatively weak results of HTML in the UECE dataset.Overall, HTML demonstrated a strong classification performance across the 12 cancer-subtype classification tasks while surpassing other stateof-the-art methods in all accuracy, F1-macro, and F1-weighted metrics.
Regarding individual datasets, our HTML model achieved high cancer subtype classification results in some relatively straightforward binary classification tasks, such as ESCA, STES and KIPAN.However, the model faced challenges in tougher classification tasks such as GBMLGG and STAD.In the GBMLGG dataset, the model may have experienced difficulty distinguishing and classifying the three subtypes of astrocytoma (ASTRO), oligodendroglioma (ODG) and oligoastrocytoma (OA) due to their common origin from glial cells of the brain and spinal cord.Similarly, in the STAD dataset, there existed ambiguous relationships between the labels adenocarcinoma (ADC) and intestinal adenocarcinoma (IAC), wherein IAC identifies a distinct ADC subtype found in the mucous membrane of the intestines forming a subset of ADCs.This complex label relationship may have contributed to the model's difficulties in accurately classifying these subtypes.

HTML predicts accurate cancer subtype prognosis
HTML is an adaptable framework that can be proficiently utilized for cancer prognosis tasks.To elucidate this point, we chose four datasets, namely, GBMLGG, UCEC, COADREAD and SARC, to predict patient cancer development across diverse cancer subtypes.The Kaplan-Meier curves for risk prediction in these patients are presented in Figure 2b.
Our analysis enabled us to effectively differentiate between high-risk and low-risk cohorts for each cancer subtype across all datasets.The low-risk group is represented by a solid line, while the high-risk group is denoted by a dotted line.For example, in the GBMLGG dataset, patients categorized as low-risk for oligodendroglioma (ODG), astrocytoma (AST) and oligoastrocytoma (OAC) demonstrated significantly superior cancer progression outcomes relative to their high-risk counterparts, and this difference was most pronounced in OAC subtypes, where the P value between low-risk and high-risk groups was <0.01.Similarly, in the UCEC dataset, patients with serous endometrial adenocarcinoma (SEAC) and mixed serous and endometrioid (MSEAC) were effectively stratified based on cancer progression risk.However, the prognostic performance for endometrioid endometrial adenocarcinoma (EEA) patients is not statistically significant, which is possibly due to a dearth of available samples.Moreover, the results from the COADREAD and SARC datasets further reinforce HTML's prognostic capabilities.Taken together, these findings provide compelling evidence of HTML's ability to accurately predict cancer progression in patients.

HTML enables personalized diagnosis and treatment
Intuitively, (i) the variation of impact for different features within a given sample on the model's results should be distinct.(ii) The variation in attention given to a particular feature across different samples should be distinct.(iii) The variation in attention given to each feature during different training stages for a given sample should be distinct.Most existing encoder models learn a single set of weights for all inputs, which constrains their ability to fully explore the inherent characteristics of the features and provides understandable interpretations of the original data.To overcome this limitation, we introduced the feature dynamic concept, which adaptively learns feature dynamic weights based on the feature's characteristics and distribution.In contrast to data preprocessing, which aims to convert raw data into an analysis-ready format by eliminating any anomalies or inconsistencies present in the dataset, feature engineering is devised to allot varying weights to different features across individual samples, enabling the model to focus intently on the most informative features of the dataset.
Based on the analysis of the multiomics data, HTML can be used not only for the diagnostic and prognostic prediction of pancancer or cancer subtypes but also for self-adaptive assignment of cancer-related genes in specific samples through dynamic learning, thereby achieving dynamic monitoring of cancer progression and providing real-time guidance for individual precision treatment.In the randomly selected 66th sample (Figure 2c), HTML assigned different dynamic weights to the genes associated with cancer subtypes.Among the genes in this sample, DNA methylation of B-cell lymphoma-2-like protein 1 (BCL2L1) and hepatocyte nuclear factor 1 homeobox A (HNF1A) were given the highest feature dynamic weights.As a member of the antiapoptotic Bcl-2 family, BCL2L1 is hypomethylated and highly expressed in chemotherapy drug (e.g.cisplatin and paclitaxel) -resistant tumor cells in vitro and has been proven to be associated with drug resistance and recurrence of solid tumor cancer [20][21][22][23].In contrast, aberrant hypermethylation of HNF1A downregulates the expression of UDP-glucuronosyltransferase 1A1 (UGT1A1), which remodels drug metabolism and transport pathways, locally inactivating anticancer drugs by glucuronidation in colon cancer cells [24].In terms of mRNA expression, microtubule affinity regulating kinase 4 (MARK4) and Laminin Subunit Gamma 2 (LAMC2), with the highest dynamic weights in sample 66, may become potential high-efficiency targets for cancer treatment.MARK4 can promote the proliferation and migration of cancer cells by inhibiting Hippo signaling, the targeted inhibition of which is a strategy to treat cancers [25].However, its absence restricts the tumorigenicity of cancer cells [26].Upregulated expression of LAMC2, an indicator of progression in esophageal squamous cell carcinoma (ESCC), is caused by lncRNA Cancer Susceptibility 9 (CASC9) interacting with CREb-binding proteins and is common in patients [27].LAMC2 and CASC9 have been shown to be important biomarkers for metastasis therapy and the prognosis of ESCC, confirming the potential for HTML to play a role in cancer diagnosis and treatment [28].HTML can also dynamically evaluate miRNA expression based on multiomics data analysis.The sponging of miR-1304 has been studied in many types of cancer, and this phenomenon that is regulated by multiple circRNAs leads to cancer proliferation and invasion [29][30][31].However, miR-1307 is involved in tumor development by inf luencing target genes, such as Forkhead box O3a (FOXO3A), SET And MYND Domain Containing 4 (SMYD4) and Disabled homolog 2-interacting protein (DAB2IP), to further promote cancer invasion [32][33][34].These genes, which were given the highest dynamic weight, may be potential targets for precision cancer therapy.We also extracted feature dynamic weights from three different genes in each sample shown in Figure 2c, with variations being exhibited across different samples.Depending on the sensitivity of HTML to individual differences, cancer diagnosis and treatment will likely be more personalized and effective in the future.
Similar to feature dynamic learning, the quality and noise level of data across modalities and samples vary.Therefore, there is a need for modulated prerequisites in the contribution of each modality toward the ultimate outcome along with the model's confidence toward each modality that aligns with data quality.As a solution, we devised a modality dynamic learning module that imparts a confidence rating for each modality.As illustrated in Figure 2d, the modality dynamic weights for different modalities across various datasets are visualized with individual data points representing each sample.Our model reveals that miRNA plays a primarily subordinate role in the classification of most datasets, with the exception of ROSMAP, which aligns with the wellestablished association between various miRNAs and Alzheimer's disease.For example, Cogswell et al. [35] reported that the expression level of hsa-miR-423 underwent significant changes in the hippocampus and medial frontal gyrus of early-and latestage AD patients compared with control samples.Some studies also identified an overlap in the expression patterns of specific miRNAs (hsa-miR-155, hsa-miR-126a, hsa-miR-23a, hsa-miR-34a, hsa-miR-9, hsa-miR-27a and hsa-miR-146a) in the retina of a rat model of age-related macular degeneration (AMD) (αβ intravitreal injection) and in the serum of AMD patients.These miRNAs are also recognized as potentially useful biomarkers of AD pathology [36][37][38].
It is worth noting that our analysis of unimodality data classification (Supplementary Table 7 available online at http://bib.oxfordjournals.org/and Figure 2e) revealed that higher classification accuracy was achieved for mRNA expression than for DNA methylation and miRNA.This could be attributed to the fact that among these three modalities, mRNA expression is most relevant to the phenotype of the corresponding cancer.However, in our analysis of modality dynamics, we observed that weights for DNA methylation were comparatively higher than those for mRNA expression in the SARC, COADREAD and BRCA datasets and took the leading positions in all other datasets.This finding provides valuable insight into the critical upstream role that DNA methylation may play in the biological processes and pathogenesis of these cancers.It is important to note that this could also be attributed to HTML's alignment process in the methylationguided attention mechanism.

HTML provides aligned representations and interpretable predictions
The various omics data, analogs to different biological processes, should adhere to the central dogma of molecular biology, whereby mRNA and miRNA expression is regulated by the DNA methylation process.Consequently, it is crucial to harmonize the representation of mRNA and miRNA expression with that of DNA methylation.In accordance with this tenet, we developed a guided attention module (Figure 1d).To exemplify the interconnectivity among genes, we randomly selected an interomics attention map (first 20 genes in each modality) from the SARC dataset of a patient aff licted with leiomyosarcoma cancer (LMS).As illustrated in Figure 1a, diverse DNA methylation spots may exhibit varying attention weights toward mRNAs and miRNAs, thus mirroring the intricate interrelationships among these genes.Moreover, we employed the concept of skip connections [39] to facilitate enhanced generalization capacity and superior performance after the guided attention module (Figure 1b).
Multiomics learning incorporates a weakly supervised learning assumption, where each modality of multiomics data pertaining to a particular sample, such as DNA methylation, mRNA expression, and miRNA expression, represents a distinct biological perspective.An inherent correlation exists among the different modalities, and they can be mapped to an overall label separately.To increase the accuracy of the overall task, representation vectors for each modality of a given sample should be as close as possible.However, current contrastive learning methods [40,41] (Figure 3c) consider little about different elements belonging to the same cluster.They tend to choose negative samples randomly from the entire dataset, which can drive away the representation of different elements of the same cluster.To address this challenge, we proposed a triple contrastive learning (TCL) method 1e.We further compared the cosine similarities To validate the effectiveness of the uncertainty metrics, we conducted an experiment where we added Gaussian noise with varying degrees to each dataset and observed the classification performance of HTML.The results are plotted on a graph, where three lines represent accuracy, AUROC (AUC), and uncertainty respectively.To ensure the reliability of our results, we calculated the 95% confidence interval of the results based on 5-fold cross-validation and represented it with shaded areas.The results showed that as the noise level in the data increased, the classification performance of the model continued to decline, accompanied by a continuous increase in the model's uncertainty.This trend was observed in all 12 datasets.(b) We investigated the expression difference of STC1 between normal and COAD as well as that between normal and READ.Expression alteration is absent in normal and COAD samples but exists in normal and READ samples (P value = 0.0019).(c) We investigated the expression differences in STRN4 between normal and COAD tissues as well as between normal and READ tissues.Expression alteration is absent in normal and READ samples but exists in normal and COAD samples (P value = 0.0363).
among representations of different modalities with and without contrastive learning in Figure 3d.We selected the first 50 samples from the STES dataset for display.The value of each point in Figure 3d is the result of the average cosine similarity among the three modalities' representation vectors in the sample.It can be seen that in most cases (except for one sample), the vector representations of the three modalities became more concentrated and their similarity to each other was greatly increased after being trained with TCL, where the most evident one changed from the similarity of 0.0216 to 0.3424.
The Softmax operator is widely employed in determining the classification evidential probability [42].This involves normalizing the neural network output using σ (z i ) = e z i K k=1 e k and subsequently selecting the class with the maximum confidence probability.Nonetheless, the conventional approach to Softmax operation places inadequate focus on the overall probability distribution of classification results, with emphasis solely on the class with the highest probability [43].Notably, the class probabilities other than the highest one can actually be informative with respect to classification outcome.Thus, considering the entire probability distribution of classification results is crucial for gaining comprehensive insight into the problem under investigation, we introduced the reliability of the classification results.
Additionally, other late fusion strategies simply concatenate the classification results for each modality or apply average pooling.HTML applies the Dempster-Shafer Theory of Evidence, which follows a trustworthy multimodal integration (TMI) rule for multimodal fusion and uncertainty calculation.The goal of TMI can be summarized in three points: (i) A modality's classification result should be referenced to its confidence score; (ii) When the uncertainty of all modalities is high, the final prediction must be of low certainty, and vice versa; (iii) The final prediction and uncertainty score should be considered together and dynamically change according to each modality's predicted outcome.
To validate the efficacy of the uncertainty metrics, we visualized the changes in the performance of HTML using the in−/outof-distribution (ID/OOD) samples.Here, we set the original samples as ID and the samples with added Gaussian noise as OOD.Specifically, we added Gaussian noise with varying standard deviations (i.e.s = 2 k , k = 0, 1, 2, . . . ., 10) to the test samples.We conducted 5-fold cross-validation on all ID and OOD samples for each dataset in the 12 datasets and calculated the mean value of classification accuracy, AUROC, and uncertainty score of HTML for each dataset under each distribution setting as well as their 95% confidence intervals, as presented in Figure 4.The information in Figure 4 effectively confirms the usefulness and validity of the uncertainty metrics.In all datasets, in-distribution data gain the highest classification performance and lowest uncertainty score.As the data distribution deviates to a greater extent, the classification performance of the model gradually deteriorates, while the uncertainty scores of the model's classification results continue to increase.We can infer that in situations where there is a significant amount of data noise, the model's ability to accurately and consistently predict and classify may be compromised.As such, the uncertainty metric can serve as a reliable compass in evaluating the credibility of the data distribution.

HTML identifies important biomarkers
Our research on subtype classification has led to the identification of a variety of DNA methylation, mRNA, and miRNA biomarkers that warrant further investigation for their potential role in the diagnosis and treatment of distinct cancer subtypes.A selection of noteworthy biomarkers is presented in Supplementary Tables 8-19 available online at http://bib.oxfordjournals.org/and Table 2. To showcase the value of our findings, we chose certain biomarkers from the COADREAD dataset (comprising COAD and READ cancer subtypes) as examples.
Within the COAD biomarkers, LRMP has been identified as hypermethylated and downregulated in a subpopulation of highly mobile cells, known as MG cells, derived from a COAD cell line.These cells are distinguished by their migratory capacity and epithelial-mesenchymal transition attributes, which are indicative of the metastatic potential and malignancy of COAD.Tanaka et al. [44] conducted a study that demonstrated the significant impairment of MG cell migration with the demethylation and upregulation of LRMP and other genes, suggesting a correlation between LRMP methylation and COAD.This relationship is further supported by the findings of various other studies, including the identification of COAD marker genes through single-cell transcriptomic analysis [45] and gene expression analysis of COAD and normal samples via RT-PCR assay [46].In addition, genes identified in mRNA expression features, such as TNKS1BP1 [47], which encodes a protein involved in telomere replication, and STRN4 [48], which encodes a scaffolding protein with multiple functions, have also been reported to play a role in COAD tumorigenesis.To further investigate this, we analysed the expression differences of STRN4 between normal and COAD samples using data from Snipstad et al. [49] and compared them to those between normal and READ samples using data from Zuurbier et al. [50], as shown in Supplementary Table 20 available online at http:// bib.oxfordjournals.org/and Figure 4b.We discovered a consistent result, with a significant expression alteration present in normal and COAD samples (P value = 0.03565) but absent in normal and READ samples.In terms of miRNA expression features, hsa-mir-101-2 was identified as one of the prognostic miRNA signatures for COAD in a study by Lv et al. [51].The targets of this signature are predicted to enrich biological process terms, such as focal adhesion and transcription disorders in cancer.
In terms of biomarkers for READ, STC1 is a member of the secretory glycoprotein family and is involved in processes such as apoptosis and inf lammation.Its association with READ has been reported [52], and its secretion by tumor stromal cells may contribute to READ metastasis by mediating PDGF receptor signaling [53].We analysed the same dataset mentioned earlier to examine the differential expression of STC1 in normal and COAD samples or normal and READ samples, and the results are shown in Supplementary Table 21 available online at http://bib.oxfordjournals.org/ and Figure 4c.We found a significant difference in STC1 expression between normal and READ with a P value close to 0.001, which suggests that STC1 could be a biomarker for READ.Among the genes identified in the mRNA expression features, SPON1 is a coding gene that produces a product predicted to be secreted to the extracellular matrix and is involved in cell adhesion, a process that could be dysregulated in tumorigenesis.Supiot et al. [54] examined its expression before and after preoperative radiotherapy in READ patients and found a significant upregulation.Similarly, upregulation of hsa-mir-765 was found to occur in READ patients in response to neoadjuvant chemoradiotherapy [55].These findings suggest that SPON1 and hsa-mir-765 could be potential biomarkers for READ.Moreover, TargetScan predicted possible target genes of hsa-mir-765, including PDX1 [56], KLK4 [57] and LHPP [58], all of which have been reported to correlate with READ.

DISCUSSION
In this study, we proposed a highly trustworthy multiomics learning framework (HTML), for personalized pancancer and cancer subtype diagnosis and prognosis.Through training on pancancer and cancer subtype datasets, HTML has the potential to seamlessly integrate multiple modalities of data, thereby enabling personalized medical diagnosis.This innovative technology is poised to overcome the existing barriers between pancancer diagnosis and cancer subtype determination, thereby achieving a comprehensive diagnosis pipeline.This pipeline is capable of processing one or more pieces of omics information from patients.It initially predicts the presence of cancer and the specific type of cancer and subsequently identifies the specific subtype of the patient's cancer, which has the potential to integrate the entire diagnostic process into one model.
Moreover, conventional models are unable to conduct sample adaptive analysis based on the unique data distribution characteristics of each sample and lack the ability to validate and verify the reliability of their own classification results.To address this challenge, we developed the HTML model, which leverages dynamic learning at both the feature and modality levels and Dirichlet uncertainty learning.This innovative model efficaciously integrates information across diverse modalities of data, enabling it to make cutting-edge classification predictions with remarkable accuracy, enabling the precise identification and diagnosis of cancer-causing genes for individuals, thereby making personalized treatment possible.HTML's feature dynamic weight mechanism assigns a unique weight to each feature.This innovative approach enables us to identify potential biomarkers that can differentiate between various types of cancers and cancer subtypes, thereby laying the groundwork for further exploration of the biological mechanisms that underpin cancer development.
Despite utilizing only DNA methylation, mRNA expression and miRNA expression data for our multiomics classification tasks, our HTML framework boasts impressive extensibility and can be customized to accommodate various types of data (such as SNVs, copy number or even medical images and texts) by adjusting the model parameters accordingly.In this way, HTML represents a highly adaptable supervised multiomics classification framework with exceptional interpretability and extensibility.

Feature dynamic learning
The multiomics input can be formulated as modalities, with each modality containing n k i -dimensional vectors.The quantity n denotes the number of samples, and the feature volume k i varies across different modalities.The symbol R represents the set of real numbers.
Suppose we have a single omic data input V i = {S 1 , S 2 , . . ., S m } , S ∈ R k i , where each sample S is represented by a k i -dimensional vector S = [s 1 , s 2 , . . ., s k i ].Our objective is to learn a feature dynamic module that generates a dynamic weight vector W = [w 1 , w 2 , . . .w k i ] ∈ R k i that performs self-adaptive feature selection.We can achieve this by defining a transformation with the input feature to obtain the dynamic feature embedding S d as follows: where W is generated through an MLP network and has the same dimension as the input S, σ represents the sigmoid function, and the sigmoid and hyperbolic tangent functions here aim to introduce nonlinear parameters in the learning process.
The concept of feature dynamics can effectively address the issue of overfitting, in which the model tends to fit the noise in the input data rather than the underlying patterns [59].Through the adaptive selection of features based on their significance to the prediction task, the model can improve its performance when encountering new samples and can achieve a more meaningful interpretation of the input features.

Methylation-guided attention
Through the feature selection module, we obtained a better interomics feature representation S d for each sample.Considering the uniqueness of multiomics data and their biological meaning, DNA methylation levels largely affect the expression levels of mRNA and miRNA.Therefore, we proposed a DNA methylationguided attention mechanism to model the interomics relationships among different omics data.
In HTML, the original DNA methylation input features serve as the query vectors d = [d 1 , d 2 , . . ., d k ], while the mRNA or miRNA input features m = [m 1 , m 2 , . . ., m k ] act as key and value vectors.Both d and m have been projected into the same vector space.The guided attention context vector S a = [c 1 , c 2 , . . ., c k ] is then calculated as: where a ij represents the attention weight between d i and m j and k represents the feature numbers of both vectors.The mRNA context vector c i is computed by the sum of all attention values on DNA methylation features as follows: where the alignment score function indicates how well the elements of the DNA features align with the mRNA features at the position and the weights a ij are computed by applying a Softmax operation to the previously computed alignment scores as follows: where √ k is the scaling factor and W a is the multiplicative attention weight matrix.
After the feature dynamic module and methylation-guided attention, we obtained both the inner-omics feature representation S d and interomics feature representation S a .We then took the average of vectors S d and S a (S for DNA methylation) to obtain the final representation vector Ŝ = 1 2 (S d + S a ) for each omics.

Triple contrastive learning
To address the interactions among multiomics data and improve the classification task accuracy, we proposed a triple contrastive loss (TCL) function for learning comparisons among multiple modalities.This loss function brings together the distances among the representations x 1 , x 2 and x 3 of all modalities (3 in our experiment) of a given sample while pulling away the representations x j of other samples: where m is the number of samples, τ is a temperature parameter, x 3i+1 is the representation of the i-th DNA methylation feature, x 3i+2 is the representation of the i-th mRNA expression feature, x 3i+3 is the representation of the i-th miRNA expression feature, and K is the number of negative samples.This process optimizes the representation vectors of each modality and improves the accuracy of the overall task.

Modality dynamic learning
As with feature dynamic learning, the contribution of individual features varies in each modality toward the final classification outcome.This holds true for disparate modalities as well.To effectively account for each modality's contribution to the final classification, HTML introduces a modality dynamic learning module to compute the model's confidence (MCC) for different modalities.
Take the DNA methylation feature Ŝ = d1 , d2 , . . ., dk as an example: where W methyl = [w 1 , w 2 , . . ., w k ] ∈ R k is the weight parameter for DNA methylation features of the modality dynamic learning module, and each MCC function returns a confidence score for each modality feature of each sample.

Dirichlet uncertainty learning
The Dirichlet distribution is a prevalent multivariate probability model used for random variable synthesis [60].It is expressed in The probability density function of the Dirichlet distribution is given by: where B (α) represents the beta distribution parameterized by α, and U k denotes the k − 1 dimensional unit simplex, defined as: It is noteworthy that each element in α corresponds to a different category concentration, and α is K-dimensional.From the Dirichlet distribution equation definition, we inferred that each unit of the vector representing the category concentration can parameterize a corresponding Dirichlet distribution.By associating the α vector with the input class probabilities, one obtains a full perspective of the overall distribution of class probability beyond simply relying on the highest classification probability.
We introduce the Dempster-Shafer Theory of Evidence (DST) [61], which utilizes belief mass to allocate subjective probabilities in describing an opinion's credibility in order to assign an uncertainty score for each modality [19].A belief mass indicates how reliable the predicted label is, while the overall uncertainty denotes how much the total probability of the class should be questioned.The belief mass and overall uncertainty can be quantified through: where K represents the number of classification types, and e k indicates supporting evidence for the belief mass.In the case of multitype classification, the evidence e k is supported by the output from the neural network.The belief mass function and the overall uncertainty are related as follows: This supports the notion that the belief mass and overall uncertainty interact complementarily.In our experiments, the Dirichlet distribution allowed modelling of secondary probability and uncertainty beyond relying solely on the highest probability as the selected class [43].Utilizing the probabilistic output e k as evidence from the neural network, we defined the Dirichlet distribution parameters α k = e k + 1 to ensure that the value of α satisfies the mathematical requisite that α − 1 in the Dirichlet distribution must be nonnegative.

Dempster-Shafer multiomics integration
Once the determination of the belief mass, overall uncertainty and evidence is complete, a quantitative measure of the reliability of opinions can be obtained.Nevertheless, in multiple modality tasks, integrating the belief mass and overall uncertainty from various sources proves to be a challenging task.The integration rule should consider the uncertainty value instead of simply reconciling the belief mass values from different sources.Additionally, multimodal integration must account for conf licting opinions across diverse modalities.Given these requirements, we proposed the following fusion rule.The integrated uncertainty value is calculated as follows: The variable C represents a measure proposed to assess the conf licting results of different opinions.In cases where opinions are at odds with each other, the sum of the belief mass product is calculated.As the frequency of conf licting scenarios increases, the value of C rises, resulting in an increase in the integrated uncertainty value, leading to a less persuasive integrated result.Concerning the fused belief mass, we set the integration rule as follows: The cross-relation between the belief mass and the overall uncertainty distributes different weights for distinct modalities linked to their respective uncertainties.For instance, when the uncertainty value of modality 1 is higher than that of modality 2, the integrated b k may rely more heavily on the input of b 2 thanks to the introduction of b 2 k u 1 , and vice versa.The DST integration rule can be easily extended to multiomics scenarios where the number of omics exceeds 2. If we use M to denote both the belief mass and the overall uncertainty: M = b k K k=1 , u , we can summarize the integration process between two modalities M 1 and M 2 as: where the operator ⊕ represents the integration process.This integration rule is suitable for commutativity and associativity.The fusion result is independent of the order of fusion and can effectively be applied to more than two modalities as follows: The loss function for DST integration methods comprises the cross-entropy loss in each modality: We formulated our objective function for HTML by combining the triple contrastive loss, individual cross-entropy loss, and integrated loss.To further enhance the classification results, we incorporated uncertainty metrics and an L 1 regularization term to mitigate overfitting.This comprehensive objective function aims to reduce uncertainty and improve the accuracy of the classification results: L overall = L TCL + L single + L integrate + u + 0.0001 * w 1

Figure 1 .
Figure 1.Framework of HTML.(a) Illustration and application of our work.(b) The overall pipeline of the HTML model.HTML is an end-to-end framework that integrates multiomics data to perform cancer diagnosis and prognosis.(c) Dynamic learning on both features and modalities.The same feature may play diverse roles across samples, and it is crucial to allocate a samplewise adaptive weight to each feature rather than relying on a fixed weight vector.Similarly, varying weights should be assigned to modalities to indicate their distinct contributions to the final results.(d) DNA methylationguided attention mechanism.DNA methylation can affect the expression levels of mRNA and miRNA.This guided attention mechanism ref lects the interrelationships among different modalities and outputs a coordinated representation that integrates the biological information among modalities.(e)Triple contrastive learning module.Multiomics learning tasks are typically conducted in an unsupervised learning setting, where the data in each modality must be subject to the overall label of the sample.In this module, modality embeddings from a given sample are considered positive pairs, while those from different samples are considered negative pairs.This approach effectively aligns the representations of different modalities, facilitating the integration of multiple data sources.(f) Dirichlet uncertainty prediction.If the probability distribution of a classification result is evenly distributed, it may indicate that the model has a limited capacity to differentiate between various categories.Conversely, when the probability distribution of a specific label is significantly high, it suggests that the model has a higher degree of confidence in its classification result and a more precise prediction.The contribution of each module in HTML was comprehensively addressed through the ablation study in Supplementary Table 4.

Figure 2 .
Figure 2. HTML is a prominent model in multiomics learning.(a) We visualized the classification confusion matrix of HTML on the pancancer dataset, and the results demonstrate that HTML exhibits exceptional performance in pancancer classification.(b) We performed survival prediction tasks on the GBMLGG, UCEC, COADREAD and SARC datasets to identify the high-risk and low-risk groups.The predicted risk groups are divided by the cancer subtypes and visualized by Kaplan-Meier curves.The dashed line represents the high-risk group, and the solid line represents the low-risk group.The P-value is annotated in the graph, where * means P-value ≤ 0.05, * * means P-value ≤ 0.01, and 'ns' indicates not significant.(c) We illustrate the distribution of feature dynamic weights under different conditions.We randomly selected the 66th sample's feature dynamic weight in different modalities from the ESCA dataset (upper 3 charts) and found that their distribution was significantly different.The same phenomena were also found in the same feature dynamic weights across samples (lower 3 charts).(d) We extracted the modality dynamic weights' distribution in all datasets and determined that the modality dynamic weights are all related to the data distribution in the specific sample.(e) We tested all possible multiomics input combinations and examined their corresponding classification performance.

Figure 3 .
Figure 3. Representation alignment across different modalities.(a) We extracted the attention weight matrices (among the first 20 genes) from the methylation-guided attention module in the SARC dataset, which provides insights into the relationships and interactions between the genes.The upper matrix represents attention between DNA methylation and mRNA, and the lower matrix represents attention between DNA methylation and miRNA.The depth of colour in the matrix indicates the strength of these relationships.These insights can be useful in understanding the mechanisms behind gene regulation and identifying potential targets for therapeutic interventions.(b) We devised the triple contrastive loss (TCL) function to enhance the alignment of vector representations among different modalities within the same sample.In comparison to other contrastive learning loss functions, such as InfoNCE loss and the InfoNCE loss between triples, our TCL approach is more effective in aligning the representation of a given sample and increasing the distance between different samples.(c) We visualized the average cosine similarity among different modalities' representations, and we discovered that our TCL method leads to a more aligned representation of different modalities within the same sample.

Figure 4 .
Figure 4. Uncertainty analysis and biomarker identification.(a) Dirichlet uncertainty analysis of multiomics data with different levels of noise.To validate the effectiveness of the uncertainty metrics, we conducted an experiment where we added Gaussian noise with varying degrees to each dataset and observed the classification performance of HTML.The results are plotted on a graph, where three lines represent accuracy, AUROC (AUC), and uncertainty respectively.To ensure the reliability of our results, we calculated the 95% confidence interval of the results based on 5-fold cross-validation and represented it with shaded areas.The results showed that as the noise level in the data increased, the classification performance of the model continued to decline, accompanied by a continuous increase in the model's uncertainty.This trend was observed in all 12 datasets.(b) We investigated the expression difference of STC1 between normal and COAD as well as that between normal and READ.Expression alteration is absent in normal and COAD samples but exists in normal and READ samples (P value = 0.0019).(c) We investigated the expression differences in STRN4 between normal and COAD tissues as well as between normal and READ tissues.Expression alteration is absent in normal and READ samples but exists in normal and COAD samples (P value = 0.0363).

⎠= y + 1 −
b m k where M is the number of modalities, and each b k represents the predicted probability.The loss function for the final classification outcome contains a cross-entropy loss and a KL divergence loss L integrate = − − ψ (α i ) ⎞ ⎠ − KL (α, γ ) where KL (α, γ ) = D KL Dir μ m | ∼ y α m is the Dirichlet distribution after replacing the α k corresponding to the ground truth label with 1, thus avoiding penalizing the Dirichlet parameter of the ground truth class to 1, and ∼ α m is the unit vector of length k.

Table 2 :
Important biomarkers identified by HTML.We listed the important biomarkers found by HTML for further investigation of each cancer subtype in 10 datasets.These biomarkers will serve as potential targets for future research and may lead to novel diagnostic or therapeutic approaches for cancer treatment